Lab 8.1-Midterm Review and across()

Load the libraries

library("tidyverse")
library("skimr")
library("janitor")
library("palmerpenguins")

Penguins

Recall that: group_by() and summarize() work great together.

penguins %>% 
  group_by(island) %>% 
  summarize(mean_body_mass_g=mean(body_mass_g, na.rm=T)) # remember to remove those NA's!
## # A tibble: 3 × 2
##   island    mean_body_mass_g
##   <fct>                <dbl>
## 1 Biscoe               4716.
## 2 Dream                3713.
## 3 Torgersen            3706.

What if we are interested in the number of observations (penguins) by species and island?

penguins %>% 
  group_by(island, species) %>% 
  summarize(n_penguins=n(), .groups = 'keep')
## # A tibble: 5 × 3
## # Groups:   island, species [5]
##   island    species   n_penguins
##   <fct>     <fct>          <int>
## 1 Biscoe    Adelie            44
## 2 Biscoe    Gentoo           124
## 3 Dream     Adelie            56
## 4 Dream     Chinstrap         68
## 5 Torgersen Adelie            52

Recall that that count() works like a combination of group_by() and summarize() but just shows the number of observations.

penguins %>% 
  count(island, species)
## # A tibble: 5 × 3
##   island    species       n
##   <fct>     <fct>     <int>
## 1 Biscoe    Adelie       44
## 2 Biscoe    Gentoo      124
## 3 Dream     Adelie       56
## 4 Dream     Chinstrap    68
## 5 Torgersen Adelie       52

tabyl() will also produce counts, but the output is different. It is just a matter of personal preference.

penguins %>% 
  tabyl(island, species)  # tabyl is part of the janitor package
##     island Adelie Chinstrap Gentoo
##     Biscoe     44         0    124
##      Dream     56        68      0
##  Torgersen     52         0      0

across()

Across allows us to usefilter() and select() across multiple variables.

Here we are summarizing the mean of all variables that contain “mm”.

penguins %>%
  summarize(across(contains("mm"), mean, na.rm=T))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(contains("mm"), mean, na.rm = T)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 3
##   bill_length_mm bill_depth_mm flipper_length_mm
##            <dbl>         <dbl>             <dbl>
## 1           43.9          17.2              201.

group_by also works.

penguins %>%
  group_by(sex) %>% 
  summarize(across(contains("mm"), mean, na.rm=T))
## # A tibble: 3 × 4
##   sex    bill_length_mm bill_depth_mm flipper_length_mm
##   <fct>           <dbl>         <dbl>             <dbl>
## 1 female           42.1          16.4              197.
## 2 male             45.9          17.9              205.
## 3 <NA>             41.3          16.6              199

Here we summarize across all variables.

penguins %>%
  summarise_all(n_distinct)
## # A tibble: 1 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##     <int>  <int>          <int>         <int>             <int>       <int>
## 1       3      3            165            81                56          95
## # ℹ 2 more variables: sex <int>, year <int>

Operators can also work, here I am summarizing n_distinct() across all variables except species, island, and sex.

penguins %>%
  summarize(across(!c(species, island, sex, year), 
                   mean, na.rm=T))
## # A tibble: 1 × 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.

All variables that include “bill”…all of the other dplyr operators also work.

penguins %>%
  summarise(across(starts_with("bill"), n_distinct))
## # A tibble: 1 × 2
##   bill_length_mm bill_depth_mm
##            <int>         <int>
## 1            165            81

Lab 8.2-Fun with NA’s

Load the libraries

library("tidyverse")
library("naniar")
library("skimr")
library("janitor")

Load the mammals life history data and clean the names

life_history <- read_csv("data_midterm2/mammal_lifehistories_v3.csv") %>% clean_names()
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Are there any NA’s?

glimpse(life_history)
## Rows: 1,440
## Columns: 13
## $ order        <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family       <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus        <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species      <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass         <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation    <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn      <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning      <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.71,…
## $ wean_mass    <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500, …
## $ afr          <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.13,…
## $ max_life     <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size  <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history)
##     order              family             genus             species         
##  Length:1440        Length:1440        Length:1440        Length:1440       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       mass             gestation         newborn             weaning       
##  Min.   :     -999   Min.   :-999.00   Length:1440        Min.   :-999.00  
##  1st Qu.:       50   1st Qu.:-999.00   Class :character   1st Qu.:-999.00  
##  Median :      403   Median :   1.05   Mode  :character   Median :   0.73  
##  Mean   :   383577   Mean   :-287.25                      Mean   :-427.17  
##  3rd Qu.:     7009   3rd Qu.:   4.50                      3rd Qu.:   2.00  
##  Max.   :149000000   Max.   :  21.46                      Max.   :  48.00  
##                                                                            
##    wean_mass             afr             max_life        litter_size      
##  Min.   :    -999   Min.   :-999.00   Min.   :   0.00   Min.   :-999.000  
##  1st Qu.:    -999   1st Qu.:-999.00   1st Qu.:   0.00   1st Qu.:   1.000  
##  Median :    -999   Median :   2.50   Median :   0.00   Median :   2.270  
##  Mean   :   16049   Mean   :-408.12   Mean   :  93.19   Mean   : -55.634  
##  3rd Qu.:      10   3rd Qu.:  15.61   3rd Qu.: 147.25   3rd Qu.:   3.835  
##  Max.   :19075000   Max.   : 210.00   Max.   :1368.00   Max.   :  14.180  
##                                                                           
##   litters_year  
##  Min.   :0.140  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:2.000  
##  Max.   :7.500  
##  NA's   :689

New from the purrr package. Gives quick summary of number of NA’s in each variable

life_history %>% 
  map_df(~ sum(is.na(.))) # It's telling you theres no NA because it read -999 as a number not NA
## # A tibble: 1 × 13
##   order family genus species  mass gestation newborn weaning wean_mass   afr
##   <int>  <int> <int>   <int> <int>     <int>   <int>   <int>     <int> <int>
## 1     0      0     0       0     0         0       0       0         0     0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>

naniar: package built to manage NA’s

A single approach to deal with NA’s in this data set

life_history_no_nas <- 
  read_csv(file = "data_midterm2/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999")) %>% clean_names() # notice that I am creating a new object/ dataframe that doesn't have any NA's
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

miss_var_summary provides a clean summary of NA’s across the data frame.

naniar::miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 wean_mass      1039    72.2 
##  2 litters_year    689    47.8 
##  3 weaning         619    43.0 
##  4 afr             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter_size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max_life          0     0

Notice that max_life has no NA’s. Does that make sense in the context of this data? (no)

hist(life_history_no_nas$max_life) # we found another way NA's are represented 

Let’s use mutate() and na_if() to replace 0’s with NA’s in max_life.

life_history_no_nas <- 
  life_history_no_nas %>% 
  mutate(max_life=na_if(max_life, 0))
miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 wean_mass      1039    72.2 
##  2 max_life        841    58.4 
##  3 litters_year    689    47.8 
##  4 weaning         619    43.0 
##  5 afr             607    42.2 
##  6 gestation       418    29.0 
##  7 mass             85     5.90
##  8 litter_size      84     5.83
##  9 order             0     0   
## 10 family            0     0   
## 11 genus             0     0   
## 12 species           0     0   
## 13 newborn           0     0

We can also use miss_var_summary with group_by(). This helps us better evaluate where NA’s are in the data.

life_history_no_nas %>%
  group_by(order) %>%
  select(order) %>% 
  miss_var_summary(order=T)
## # A tibble: 0 × 3
## # Groups:   order [0]
## # ℹ 3 variables: order <chr>, n_miss <int>, pct_miss <dbl>

naniar also has a nice replace function which will allow you to precisely control which values you want replaced with NA’s in each variable.

life_history %>% # going back to original data
  replace_with_na(replace = list(newborn = "not measured", 
                                 weaning= -999, 
                                 wean_mass= -999, 
                                 afr= -999, 
                                 max_life= 0, 
                                 litter_size= -999, 
                                 gestation= -999, 
                                 mass= -999)) %>% 
miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 wean_mass      1039    72.2 
##  2 max_life        841    58.4 
##  3 litters_year    689    47.8 
##  4 weaning         619    43.0 
##  5 afr             607    42.2 
##  6 newborn         595    41.3 
##  7 gestation       418    29.0 
##  8 mass             85     5.90
##  9 litter_size      84     5.83
## 10 order             0     0   
## 11 family            0     0   
## 12 genus             0     0   
## 13 species           0     0
# Makes replacement of NA's very specific

You can also use naniar to replace a specific value (like -999) with NA across the entire data set.

life_history %>% #going back to the original data
  replace_with_na_all(condition = ~.x == -999)%>% 
miss_var_summary()
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 wean_mass      1039    72.2 
##  2 litters_year    689    47.8 
##  3 weaning         619    43.0 
##  4 afr             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter_size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max_life          0     0

Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.

common_na_strings
##  [1] "missing" "NA"      "N A"     "N/A"     "#N/A"    "NA "     " NA"    
##  [8] "N /A"    "N / A"   " N / A"  "N / A "  "na"      "n a"     "n/a"    
## [15] "na "     " na"     "n /a"    "n / a"   " a / a"  "n / a "  "NULL"   
## [22] "null"    ""        "\\?"     "\\*"     "\\."
common_na_numbers
## [1]    -9   -99  -999 -9999  9999    66    77    88
life_history %>% #going back to the original data
  replace_with_na_all(condition = ~.x %in% c(common_na_strings, common_na_numbers)) %>% 
  mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
##    order   family genus species   mass gestation newborn weaning wean_mass   afr
##    <chr>   <chr>  <chr> <chr>    <dbl>     <dbl> <chr>     <dbl>     <dbl> <dbl>
##  1 Artiod… Antil… Anti… americ… 4.54e4      8.13 3246.36    3         8900  13.5
##  2 Artiod… Bovid… Addax nasoma… 1.82e5      9.39 5480       6.5         NA  27.3
##  3 Artiod… Bovid… Aepy… melamp… 4.15e4      6.35 5093       5.63     15900  16.7
##  4 Artiod… Bovid… Alce… busela… 1.5 e5      7.9  10166.…    6.5         NA  23.0
##  5 Artiod… Bovid… Ammo… clarkei 2.85e4      6.8  <NA>      NA           NA  NA  
##  6 Artiod… Bovid… Ammo… lervia  5.55e4      5.08 3810       4           NA  14.9
##  7 Artiod… Bovid… Anti… marsup… 3   e4      5.72 3910       4.04        NA  10.2
##  8 Artiod… Bovid… Anti… cervic… 3.75e4      5.5  3846       2.13        NA  20.1
##  9 Artiod… Bovid… Bison bison   4.98e5      8.93 20000     10.7     157500  29.4
## 10 Artiod… Bovid… Bison bonasus 5   e5      9.14 23000.…    6.6         NA  30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>

Visualizing NAs

There is another package visdat that can be used to visualize the proportion of different classes of data, including missing data. But, it is limited by size.

library(visdat)
vis_dat(life_history) #classes of data

vis_miss(life_history)

Dealing with NA’s in advance

If you are sure that you know how NA’s are treated in the data, then you can deal with them in advance using na() as part of the readr package.

life_history_advance <- 
  readr::read_csv(file = "data_midterm2/mammal_lifehistories_v3.csv", 
                  na = c("NA", " ", ".", "-999")) #all NA, blank spaces, .,and -999 are treated as NA
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
miss_var_summary(life_history_advance)
## # A tibble: 13 × 3
##    variable     n_miss pct_miss
##    <chr>         <int>    <dbl>
##  1 wean mass      1039    72.2 
##  2 litters/year    689    47.8 
##  3 weaning         619    43.0 
##  4 AFR             607    42.2 
##  5 gestation       418    29.0 
##  6 mass             85     5.90
##  7 litter size      84     5.83
##  8 order             0     0   
##  9 family            0     0   
## 10 Genus             0     0   
## 11 species           0     0   
## 12 newborn           0     0   
## 13 max. life         0     0

Lab 9.1-Tidyr 1: Tidy data and pivot_long()

Load the tidyverse

library(tidyverse)

Tidy data

Tidy data follows three conventions:
(1) each variable has its own column
(2) each observation has its own row
(3) each value has its own cell

pivot_longer(): shifts data from wide (data entry in spreadsheets, column names may represent values of variable) to long format.

Rules:
+ pivot_longer(cols, names_to, values_to) + cols - Columns to pivot to longer format + names_to - Name of the new column; it will contain the column names of gathered columns as values + values_to - Name of the new column; it will contain the data stored in the values of gathered columns

Example 1: column names are data

The following data show results from a drug trial with four treatments on six patients. The values represent resting heart rate.

heartrate <- read_csv("data_midterm2/heartrate.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate
## # A tibble: 6 × 5
##   patient      a     b     c     d
##   <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 Margaret    72    74    80    68
## 2 Frank       84    84    88    76
## 3 Hawkeye     64    66    68    64
## 4 Trapper     60    58    64    58
## 5 Radar       74    72    78    70
## 6 Henry       88    87    88    72

To fix untidy data, we need to reshape the table to long format while keeping track of column names and values. We do this using pivot_longer(). Notice that the dimensions of the data frame change.

heartrate %>% 
  pivot_longer(-patient, #patient will not move
               names_to = "drug", #make a new column called "drug"
               values_to="heartrate" #values moved to a new column called "heartrate"
               )
## # A tibble: 24 × 3
##    patient  drug  heartrate
##    <chr>    <chr>     <dbl>
##  1 Margaret a            72
##  2 Margaret b            74
##  3 Margaret c            80
##  4 Margaret d            68
##  5 Frank    a            84
##  6 Frank    b            84
##  7 Frank    c            88
##  8 Frank    d            76
##  9 Hawkeye  a            64
## 10 Hawkeye  b            66
## # ℹ 14 more rows

Example 2: some column names are data

Some (but not all) of the column names are data. We also have NA’s.

billboard <- read_csv("data_midterm2/billboard.csv")
## Rows: 317 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): artist, track
## dbl  (65): wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10, wk11, wk12, wk...
## lgl  (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, wk75, wk76
## date  (1): date.entered
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

Solution 1: specify a range of columns that you want to pivot.

billboard2 <- 
  billboard %>% 
  pivot_longer(wk1:wk76, # a range of columns
               names_to = "week",
               values_to = "rank", 
               values_drop_na = TRUE #this will drop the NA's
               )
billboard2
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

Solution 2: OR, specify columns that you want to stay fixed.

billboard3 <- 
  billboard %>% 
  pivot_longer(-c(artist, track, date.entered), #specific columns to stay fixed (by using c, meaning dont move these)
               names_to = "week",
               values_to = "rank",
               values_drop_na = TRUE # this will drop the NA's 
               )
billboard3
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

Solution 3: identify columns by a prefix, remove the prefix and all NA’s.

billboard %>% 
   pivot_longer(
   cols = starts_with("wk"), #columns that start with "wk"
   names_to = "week",
   names_prefix = "wk",
   values_to = "rank",
   values_drop_na = TRUE)
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
## # ℹ 5,297 more rows

Example 3: more than one variable in a column name

In this case, there are two observations in each column name.

qpcr_untidy <- read_csv("data_midterm2/qpcr_untidy.csv")
## Rows: 5 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gene
## dbl (9): exp1_rep1, exp1_rep2, exp1_rep3, exp2_rep1, exp2_rep2, exp2_rep3, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qpcr_untidy
## # A tibble: 5 × 10
##   gene  exp1_rep1 exp1_rep2 exp1_rep3 exp2_rep1 exp2_rep2 exp2_rep3 exp3_rep1
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 A          21.7      19.8      20.7      18.3      20.4      17.6      20.6
## 2 B          24.3      24.8      25.2      26.0      29.9      26.4      25.4
## 3 C          20.7      21.5      21.3      25.5      18.7      22.3      21.9
## 4 D          26.9      28.0      27.7      33.1      24.3      28.9      28.5
## 5 E          25.0      22.7      23.8      21.1      23.4      20.2      23.7
## # ℹ 2 more variables: exp3_rep2 <dbl>, exp3_rep3 <dbl>

names_sep helps pull these apart, but we still have “exp” and “rep” to deal with.

qpcr_untidy %>% 
  pivot_longer(
    exp1_rep1:exp3_rep3,
    names_to= c("experiment", "replicate"),
    names_sep="_", #names seperate by underscore
    values_to="mRNA_expression"
  )
## # A tibble: 45 × 4
##    gene  experiment replicate mRNA_expression
##    <chr> <chr>      <chr>               <dbl>
##  1 A     exp1       rep1                 21.7
##  2 A     exp1       rep2                 19.8
##  3 A     exp1       rep3                 20.7
##  4 A     exp2       rep1                 18.3
##  5 A     exp2       rep2                 20.4
##  6 A     exp2       rep3                 17.6
##  7 A     exp3       rep1                 20.6
##  8 A     exp3       rep2                 19.9
##  9 A     exp3       rep3                 19.2
## 10 B     exp1       rep1                 24.3
## # ℹ 35 more rows

Lab 9.2-Tidyr 2: pivot_wider()

pivot_longer()

Recall that we use pivot_longer() when our column names actually represent variables. A classic example would be that the column names represent observations of a variable.

datasets::USPersonalExpenditure
##                       1940   1945  1950 1955  1960
## Food and Tobacco    22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health   3.530  5.760  9.71 14.0 21.10
## Personal Care        1.040  1.980  2.45  3.4  5.40
## Private Education    0.341  0.974  1.80  2.6  3.64
?USPersonalExpenditure

Here we add a new column of expenditure types, which are stored as rownames above, with mutate(). The USPersonalExpenditures data also needs to be converted to a data frame before we can use the tidyverse functions, because it comes as a matrix.

expenditures <- USPersonalExpenditure %>% 
  as_tibble() %>% #this transforms the matrix into a data frame
  mutate(expenditure = rownames(USPersonalExpenditure))
expenditures
## # A tibble: 5 × 6
##   `1940` `1945` `1950` `1955` `1960` expenditure        
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>              
## 1 22.2   44.5    59.6    73.2  86.8  Food and Tobacco   
## 2 10.5   15.5    29      36.5  46.2  Household Operation
## 3  3.53   5.76    9.71   14    21.1  Medical and Health 
## 4  1.04   1.98    2.45    3.4   5.4  Personal Care      
## 5  0.341  0.974   1.8     2.6   3.64 Private Education

separate()

In this new heart rate example, we have the sex of each patient included with their name. Are these data tidy? No, there is more than one value per cell in the patient column and the columns a, b, c, d once again represent values.

heartrate2 <- read_csv("data_midterm2/heartrate2.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate2
## # A tibble: 6 × 5
##   patient        a     b     c     d
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 Margaret_f    72    74    80    68
## 2 Frank_m       84    84    88    76
## 3 Hawkeye_m     64    66    68    64
## 4 Trapper_m     60    58    64    58
## 5 Radar_m       74    72    78    70
## 6 Henry_m       88    87    88    72

We need to start by separating the patient names from their sexes. separate() needs to know which column you want to split, the names of the new columns, and what to look for in terms of breaks in the data.

heartrate2 %>% 
  separate(patient, into= c("patient", "sex"), sep = "_")
## # A tibble: 6 × 6
##   patient  sex       a     b     c     d
##   <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret f        72    74    80    68
## 2 Frank    m        84    84    88    76
## 3 Hawkeye  m        64    66    68    64
## 4 Trapper  m        60    58    64    58
## 5 Radar    m        74    72    78    70
## 6 Henry    m        88    87    88    72

Re-examine heartrate2. Use separate() for the sexes, pivot_longer() to tidy, and arrange() to organize by patient and drug. Store this as a new object heartrate3.

heartrate3 <- heartrate2 %>% 
  separate(patient, into=c("patient", "sex"), sep="_") %>% 
  pivot_longer(-c(patient, sex),
               names_to = "drug",
               values_to = "heartrate")
heartrate3
## # A tibble: 24 × 4
##    patient  sex   drug  heartrate
##    <chr>    <chr> <chr>     <dbl>
##  1 Margaret f     a            72
##  2 Margaret f     b            74
##  3 Margaret f     c            80
##  4 Margaret f     d            68
##  5 Frank    m     a            84
##  6 Frank    m     b            84
##  7 Frank    m     c            88
##  8 Frank    m     d            76
##  9 Hawkeye  m     a            64
## 10 Hawkeye  m     b            66
## # ℹ 14 more rows

unite() is the opposite of separate(). Its syntax is straightforward. You only need to give a new column name and then list the columns to combine with a separation character. Give it a try below by recombining patient and sex from heartrate3.

heartrate3 %>% 
  unite(patient_sex, "patient", "sex", sep=" ") #puts sexes back in name
## # A tibble: 24 × 3
##    patient_sex drug  heartrate
##    <chr>       <chr>     <dbl>
##  1 Margaret f  a            72
##  2 Margaret f  b            74
##  3 Margaret f  c            80
##  4 Margaret f  d            68
##  5 Frank m     a            84
##  6 Frank m     b            84
##  7 Frank m     c            88
##  8 Frank m     d            76
##  9 Hawkeye m   a            64
## 10 Hawkeye m   b            66
## # ℹ 14 more rows

pivot_wider()

The opposite of pivot_longer(). You use pivot_wider() when you have an observation scattered across multiple rows. In the example below, cases and population represent variable names not observations.

Rules:
+ pivot_wider(names_from, values_from)
+ names_from - Values in the names_from column will become new column names
+ values_from - Cell values will be taken from the values_from column

tb_data <- read_csv("data_midterm2/tb_data.csv")
## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, key
## dbl (2): year, value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb_data
## # A tibble: 12 × 4
##    country      year key             value
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

When using pivot_wider() we use names_from to identify the variables (new column names) and values_from to identify the values associated with the new columns.

tb_data_wide <- tb_data %>% 
  pivot_wider(names_from = "key", #the observations under key will become new columns
              values_from = "value") #the values under value will be moved to the new columns
tb_data_wide
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

pivot_longer to transform them back to long!

tb_data_long <- tb_data_wide %>% 
  pivot_longer(c(-country, - year),
               names_to = "key",
               values_to = "value")
tb_data_long
## # A tibble: 12 × 4
##    country      year key             value
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Lab 10.1-Data Visualization: ggplot part 1

Data Types

We first need to define some of the data types we will use to build plots.

  • discrete quantitative data that only contains integers
  • continuous quantitative data that can take any numerical value
  • categorical qualitative data that can take on a limited number of values

Basics

The syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.

In short, plot= data + geom_ + aesthetics.

Example

To make things easy, let’s start with some built in data.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.

ggplot(data=iris, #specify the data
       mapping=aes(x=Species, y=Petal.Length)) #map the aesthetics

# species is a factor on x-axis

Here we specify that we want a boxplot, indicated by geom_boxplot().

ggplot(data=iris, #specify the data
       mapping=aes(x=Species, y=Petal.Length))+ #map the aesthetics
  geom_boxplot() #add the plot type

Scatterplots and barplots

Now that we have a general idea of the syntax, let’s start by working with two common plots: 1) scatter plots and 2) bar plots.

homerange <- read_csv("data_midterm2/Tamburelloetal_HomeRangeDatabase.csv")
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl  (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Scatter Plots

Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression aka. “best of fit” lines or calculate any r2 values.

In the case below, we are exploring whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.

names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
ggplot(data=homerange, #specify the data
       mapping=aes(x=log10.mass, y=log10.hra))+ #map the aesthetics
  geom_point() #add the plot type

In big data sets with lots of overlapping values, over-plotting can be an issue. geom_jitter() is similar to geom_point() but it helps with over plotting by adding some random noise to the data and separating some of the individual points.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
  geom_jitter() #adds some random noise to separate the points a little

To add a regression (best of fit) line, we just add another layer.

ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
  geom_point()+
  geom_smooth(method=lm, se=T) #add a regression line (linear, standard error)
## `geom_smooth()` using formula = 'y ~ x'

Bar Plot: geom_bar()

The simplest type of bar plot counts the number of observations in a categorical variable. In this case, we want to know how many observations are present in the variable trophic.guild. Notice that we do not specify a y-axis because it is count by default.

names(homerange)
##  [1] "taxon"                      "common.name"               
##  [3] "class"                      "order"                     
##  [5] "family"                     "genus"                     
##  [7] "species"                    "primarymethod"             
##  [9] "N"                          "mean.mass.g"               
## [11] "log10.mass"                 "alternative.mass.reference"
## [13] "mean.hra.m2"                "log10.hra"                 
## [15] "hra.reference"              "realm"                     
## [17] "thermoregulation"           "locomotion"                
## [19] "trophic.guild"              "dimension"                 
## [21] "preymass"                   "log10.preymass"            
## [23] "PPMR"                       "prey.size.reference"
homerange %>% 
  count(trophic.guild)
## # A tibble: 2 × 2
##   trophic.guild     n
##   <chr>         <int>
## 1 carnivore       342
## 2 herbivore       227

Also notice that we can use pipes! The mapping= function is implied by aes and so is often left out.

homerange %>% 
  ggplot(aes(x=trophic.guild)) + #only works with categorical data 
  geom_bar() #good for counts

Bar Plot: geom_col()

Unlike geom_bar(), geom_col() allows us to specify an x-axis and a y-axis.

homerange %>% 
  filter(family=="salmonidae") %>%
  select(common.name, log10.mass) %>% 
  ggplot(aes(y=common.name, x=log10.mass))+ #notice the switch in x and y
  geom_col()+
  coord_flip() #flips the graph

geom_bar() with stat="identity" stat="identity" allows us to map a variable to the y-axis so that we aren’t restricted to counts.

homerange %>% 
  filter(family=="salmonidae") %>% 
  ggplot(aes(x=common.name, y=log10.mass))+
  geom_bar(stat="identity") # need to specify this

Lab 10.2-Data Visualization: ggplot part 1

Box Plots

For the next series of examples, we will use the homerange data. Database of vertebrate home range sizes.

Boxplots help us visualize a range of values. So, on the x-axis we typically have something categorical and the y-axis is the range. In the case below, we are plotting log10.mass by taxonomic class in the homerange data. geom_boxplot() is the geom type for a standard box plot. The center line in each box represents the median, not the mean.

Let’s look at the variable log10.mass grouped by taxonomic class.

homerange %>% 
  group_by(class) %>% 
  summarize(min_log10.mass=min(log10.mass),
            max_log10.mass=max(log10.mass),
            median_log10.mass=median(log10.mass))
## # A tibble: 4 × 4
##   class          min_log10.mass max_log10.mass median_log10.mass
##   <chr>                   <dbl>          <dbl>             <dbl>
## 1 actinopterygii         -0.658           3.55              2.08
## 2 aves                    0.712           4.95              1.82
## 3 mammalia                0.620           6.60              3.33
## 4 reptilia                0.539           4.03              2.51
homerange %>% 
  ggplot(aes(x = class, y = log10.mass)) +
  geom_boxplot()

Lab 11.1-Data Visualization: ggplot part 2

Load the data

Let’s revisit the mammal life history data to practice our ggplot skills. The data are from: S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.

life_history <- read_csv("data_midterm2/mammal_lifehistories_v2.csv", na="-999") %>% clean_names()

Bar Plots

Bar plots count the number of observations in a categorical variable. What is the difference between geom_bar and geom_col? Make two bar plots showing the number of observations for each order using each geom type.

names(life_history)
##  [1] "order"        "family"       "genus"        "species"      "mass"        
##  [6] "gestation"    "newborn"      "weaning"      "wean_mass"    "afr"         
## [11] "max_life"     "litter_size"  "litters_year"

geom_col

life_history %>% 
  count(order, sort=T) %>%  # This is the same as arrange
  ggplot(aes(x=order, y=n))+
  geom_col()+
  coord_flip()

geom_bar

life_history %>% 
  ggplot(aes(x=order))+
  geom_bar()+
  coord_flip()

Remember that ggplot build plots in layers. These layers can significantly improve the appearance of the plot. What if we wanted a bar plot of the mean mass for each order? Would we use geom_bar or geom_col? (geom_col)

life_history %>% 
  group_by(order) %>% 
  summarize(mean_mass = mean(mass, na.rm=T)) %>% 
  ggplot(aes(x=order, y=mean_mass))+
  geom_col()+
  coord_flip()

There are a few problems here. First, the y-axis is in scientific notation. We can fix this by adjusting the options for the session.

options(scipen=999)#cancels scientific notation for the session

Next, the y-axis is not on a log scale. We can fix this by adding scale_y_log10().

life_history %>% 
  group_by(order) %>% 
  summarize(mean_mass = mean(mass, na.rm=T)) %>% 
  ggplot(aes(x=order, y=mean_mass))+
  geom_col()+
  coord_flip()+
  scale_y_log10()

Lastly, we can adjust the x-axis labels to make them more readable. We do this using reorder.

life_history %>% 
  group_by(order) %>% 
  summarize(mean_mass = mean(mass, na.rm=T)) %>% 
  ggplot(aes(x=reorder(order, mean_mass), y=mean_mass))+ # This allows us to order the x-axis by mean_mass
  geom_col()+
  coord_flip()+
  scale_y_log10()

Scatterplots

Scatter plots allow for comparisons of two continuous variables. Make a scatterplot below that compares gestation time and weaning mass.

life_history %>% 
  ggplot(aes(x=gestation, y=wean_mass))+
  geom_jitter(na.rm = T)+ # prevents overplotting
  scale_y_log10()

Boxplots

Box plots help us visualize a range of values. So, on the x-axis we typically have something categorical and the y-axis is the range. Let’s make a box plot that compares mass across taxonomic orders.

life_history %>% 
  ggplot(aes(x=order, y=mass))+
  geom_boxplot(na.rm = T)+
  coord_flip()+ 
  scale_y_log10()

Aesthetics: Labels

Now that we have practiced scatter plots, bar plots, and box plots we need to learn how to adjust their appearance to suit our needs. Let’s start with labeling x and y axes.

elephants <- read_csv("data_midterm2/elephantsMF.csv") %>% clean_names()
## Rows: 288 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (2): Age, Height
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make a plot that compares age and height of elephants.

elephants %>% 
  ggplot(aes(x=age, y=height))+
  geom_point()+
  geom_smooth(method = lm, sex = F)
## Warning in geom_smooth(method = lm, sex = F): Ignoring unknown parameters:
## `sex`
## `geom_smooth()` using formula = 'y ~ x'

The plot looks clean, but it is incomplete. A reader unfamiliar with the data might have a difficult time interpreting the labels. To add custom labels, we use the labs command.

elephants %>% 
  ggplot(aes(x=age, y=height)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F)+
  labs(title="Elephant Age vs. Height", # adds a title
       x="Age (years)",
       y="Height (cm)")
## `geom_smooth()` using formula = 'y ~ x'

We can improve the plot further by adjusting the size and face of the text. We do this using theme(). The rel() option changes the relative size of the title to keep things consistent. Adding hjust allows control of title position.

elephants %>% 
  ggplot(aes(x=age, y=height)) + 
  geom_point() + 
  geom_smooth(method=lm, se=F)+ # se = standard error along with line
  labs(title="Elephant Age vs. Height", # adds a title
       x="Age (years)",
       y="Height (cm)")+
  theme(plot.title=element_text(size=rel(1.5), hjust = 0.5)) # rel = title relative to the plot, hjust = horizontal (left = 0, right = 1)
## `geom_smooth()` using formula = 'y ~ x'

Other Aesthetics

There are lots of options for aesthetics. An aesthetic can be assigned to either numeric or categorical data. fill is a common grouping option; notice that an appropriate key is displayed when you use one of these options.

elephants %>% 
  ggplot(aes(x=sex, fill=sex))+ #fill is a grouping option
  geom_bar()

size adjusts the size of points relative to a continuous variable.

life_history %>% 
  ggplot(aes(x=gestation, y=log10(mass), size=mass))+ # Made size relative to mass of thing plotted
  geom_point(na.rm=T)

Lab 11.2-Data Visualization: ggplot part 2

homerange <- 
  read_csv("data_midterm2/Tamburelloetal_HomeRangeDatabase.csv", na = c("", "NA", "\\"))
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl  (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A few more useful aesthetics

There are many options to create nice plots in ggplot. One useful trick is to store the plot as a new object and then experiment with geom’s and aesthetics. Let’s setup a plot that compares log10.mass and log10.hra. Notice that we are not specifying a geom.

p <- homerange %>% 
  ggplot(aes(x= log10.mass, y= log10.hra)) # store base plot as object

Play with point size by adjusting the size argument.

p + geom_point(size=1.25)

We can color the points by a categorical variable.

p + geom_point(aes(color=thermoregulation), size=1.5) # For scatterplots, "color" is the same as "fill" in barplots

We can also map shapes to another categorical variable.

p + geom_point(aes(color=thermoregulation, shape=thermoregulation), size=1.5)

Barplots and multiple variables

At this point you should be comfortable building bar plots that show counts of observations using geom_bar(). Last time we explored the fill option as a way to bring color to the plot; i.e. we filled by the same variable that we were plotting. What happens when we fill by a different categorical variable?
Let’s start by counting how many observations we have in each taxonomic group.

homerange %>% count(taxon, sort=T)
## # A tibble: 9 × 2
##   taxon             n
##   <chr>         <int>
## 1 mammals         238
## 2 birds           140
## 3 marine fishes    90
## 4 snakes           41
## 5 river fishes     14
## 6 turtles          14
## 7 tortoises        12
## 8 lizards          11
## 9 lake fishes       9

Now let’s make a bar plot of these data.

homerange %>% 
  ggplot(aes(x=taxon))+
  geom_bar()+
  coord_flip()+
  labs(title="Observations by Taxon",
       x="Taxonomic Group")

By specifying fill=trophic.guild we build a stacked bar plot that shows the proportion of a given taxonomic group that is an herbivore or carnivore.

homerange %>% 
  ggplot(aes(x=taxon, fill=trophic.guild))+
  geom_bar()+
  coord_flip()+
  labs(title="Observations by Taxon",
       x="Taxonomic Group")

We can also have counts of each trophic guild within taxonomic group shown side-by-side by specifying position="dodge".

homerange %>% 
  ggplot(aes(x=taxon, fill=trophic.guild))+
  geom_bar(position="dodge")+
  coord_flip()+
  labs(title="Observations by Taxon",
       x="Taxonomic Group")

Here is the same plot oriented vertically.

homerange %>% 
  ggplot(aes(x=taxon, fill=trophic.guild))+
  geom_bar(position="dodge")+
  theme(axis.text.x=element_text(angle=60, hjust = 1))+ #change angle of x-axis observations
  labs(title="Observations by Taxon",
       x="Taxonomic Group")

We can also scale all bars to a percentage.

homerange %>% 
  ggplot(aes(x = taxon, fill = trophic.guild))+
  geom_bar(position = position_fill())+ 
  scale_y_continuous(labels = scales::percent)+
  coord_flip()

Using group

In addition to fill, group is an aesthetic that accomplishes the same function but does not add color.

Here is a box plot that shows log10.mass by taxonomic class.

homerange %>% 
  ggplot(aes(x = class, y = log10.mass)) +
  geom_boxplot()

I use group to make individual box plots for each taxon within class.

homerange %>% 
  ggplot(aes(x = class, y = log10.mass, group = taxon)) +
  geom_boxplot()

I can also use fill to associate the different taxa with a color coded key.

homerange %>% 
  ggplot(aes(x = class, y = log10.mass, fill = taxon)) +
  geom_boxplot(alpha=0.5)

Lab 12.1-Data Visualization: ggplot part 3

Load the libraries

library(tidyverse)
library(RColorBrewer)
library(paletteer)
library(janitor)
options(scipen=999) #cancels the use of scientific notation for the session

Data

For this tutorial, we will use two data sets: deserts and homerange

deserts <- read_csv("data_midterm2/surveys_complete.csv")

Line plots

Line plots are great when you need to show changes over time. Here we look at the number of samples for species DM and DS over the years represented in the deserts data. This takes some careful thought- we want to know how sampling has changed over time for these two species.

Let’s start by making a clear x and y so we know what we are going to plot.

deserts %>% 
  filter(species_id=="DM" | species_id == "DS") %>% 
  mutate(year=as.factor(year)) %>%  # Change to factor bc R would think you're doing a numeric (addition etc.) but you dont add years
  group_by(year, species_id) %>% 
  summarize(n=n(), .groups = 'keep') %>% 
  pivot_wider(names_from = species_id, values_from = n)
## # A tibble: 26 × 3
## # Groups:   year [26]
##    year     DM    DS
##    <fct> <int> <int>
##  1 1977    264    98
##  2 1978    389   320
##  3 1979    209   204
##  4 1980    493   346
##  5 1981    559   354
##  6 1982    609   354
##  7 1983    528   280
##  8 1984    396    76
##  9 1985    667    98
## 10 1986    406    88
## # ℹ 16 more rows
deserts %>% 
  filter(species_id=="DM" | species_id == "DS") %>% 
  mutate(year=as.factor(year)) %>% # If you remove this gaps would be big and treated as numeric
  group_by(year, species_id) %>% 
  summarize(n=n(), .groups = 'keep') %>% 
  ggplot(aes(x=year, y=n, group=species_id, color = species_id))+
  geom_line()+
  geom_point(shape=2)+ # You can experiment with shapes
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  labs(title="Number of Samples for Species DM & DS",
       x = "Year",
       y = "n")

Histograms

Histograms are frequently used by biologists; they show the distribution of continuous variables. As students, you have seen histograms of grade distributions. A histogram bins the data and you specify the number of bins that encompass a range of observations. For something like grades, this is easy because the number of bins corresponds to the grades A-F. By default, R uses a formula to calculate the number of bins but some adjustment may be required.

What does the distribution of body mass look like in the homerange data?

homerange %>% 
  ggplot(aes(x = log10.mass)) +
  geom_histogram(bins = 30)+ #we can adjust the number of bins with the bins argument
  labs(title = "Distribution of Body Mass")

Let’s play with the colors. This shows all 657 of R’s built-in colors. Notice that color and fill do different things!

#grDevices::colors()

Let’s rebuild the histogram, but this time we will specify the color and fill. Do a little experimentation on your own with the different colors.

homerange %>% 
  ggplot(aes(x = log10.mass)) +
  geom_histogram(color = "black", fill = "powderblue", bins=10)+
  labs(title = "Distribution of Body Mass")

Density plots

Density plots are similar to histograms but they use a smoothing function to make the distribution more even and clean looking. They do not use bins.

homerange %>% 
  ggplot(aes(x = log10.mass)) +
  geom_density(fill="deepskyblue4", alpha  =0.4, color = "black")+ #alpha is the transparency
  labs(title = "Distribution of Body Mass")

I like to see both the histogram and the density curve so I often plot them together. Note that I assign the density plot a different color.

homerange %>% 
  ggplot(aes(x=log10.mass)) +
  geom_histogram(aes(y = after_stat(density)), fill = "deepskyblue4", alpha = 0.4, color = "black")+
  geom_density(color = "red")+
  labs(title = "Distribution of Body Mass")

Create Categories with mutate and case_when()

case_when() is a very handy function from dplyr which allows us to calculate a new variable from other variables. We use case_when() within mutate() to do this.case_when() allows us to specify multiple conditions. Let’s reclassify the body mass variable into a new factor variable with small, medium, and large animals. In this case, we are making a continuous variable into a categorical variable.

glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon                      <chr> "lake fishes", "river fishes", "river fishe…
## $ common.name                <chr> "american eel", "blacktail redhorse", "cent…
## $ class                      <chr> "actinopterygii", "actinopterygii", "actino…
## $ order                      <chr> "anguilliformes", "cypriniformes", "cyprini…
## $ family                     <chr> "anguillidae", "catostomidae", "cyprinidae"…
## $ genus                      <chr> "anguilla", "moxostoma", "campostoma", "cli…
## $ species                    <chr> "rostrata", "poecilura", "anomalum", "fundu…
## $ primarymethod              <chr> "telemetry", "mark-recapture", "mark-recapt…
## $ N                          <chr> "16", NA, "20", "26", "17", "5", "2", "2", …
## $ mean.mass.g                <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00,…
## $ log10.mass                 <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600,…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mean.hra.m2                <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, 3…
## $ log10.hra                  <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437,…
## $ hra.reference              <chr> "Minns, C. K. 1995. Allometry of home range…
## $ realm                      <chr> "aquatic", "aquatic", "aquatic", "aquatic",…
## $ thermoregulation           <chr> "ectotherm", "ectotherm", "ectotherm", "ect…
## $ locomotion                 <chr> "swimming", "swimming", "swimming", "swimmi…
## $ trophic.guild              <chr> "carnivore", "carnivore", "carnivore", "car…
## $ dimension                  <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
## $ preymass                   <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, N…
## $ log10.preymass             <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA, …
## $ PPMR                       <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, NA…
## $ prey.size.reference        <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 20…
homerange %>% 
  select(log10.mass) %>% 
  summarize(min=min(log10.mass),
            max=max(log10.mass))
## # A tibble: 1 × 2
##      min   max
##    <dbl> <dbl>
## 1 -0.658  6.60
summary(homerange$log10.mass)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6576  1.6990  2.5185  2.5947  3.3324  6.6021
homerange %>% 
  mutate(mass_category=case_when(log10.mass<=1.75 ~ "small",
                                 log10.mass>1.75 & log10.mass <= 2.75 ~ "medium",
                                 log10.mass>2.75 ~ "large")) 
## # A tibble: 569 × 25
##    taxon        common.name class order family genus species primarymethod N    
##    <chr>        <chr>       <chr> <chr> <chr>  <chr> <chr>   <chr>         <chr>
##  1 lake fishes  american e… acti… angu… angui… angu… rostra… telemetry     16   
##  2 river fishes blacktail … acti… cypr… catos… moxo… poecil… mark-recaptu… <NA> 
##  3 river fishes central st… acti… cypr… cypri… camp… anomal… mark-recaptu… 20   
##  4 river fishes rosyside d… acti… cypr… cypri… clin… fundul… mark-recaptu… 26   
##  5 river fishes longnose d… acti… cypr… cypri… rhin… catara… mark-recaptu… 17   
##  6 river fishes muskellunge acti… esoc… esoci… esox  masqui… telemetry     5    
##  7 marine fish… pollack     acti… gadi… gadid… poll… pollac… telemetry     2    
##  8 marine fish… saithe      acti… gadi… gadid… poll… virens  telemetry     2    
##  9 marine fish… lined surg… acti… perc… acant… acan… lineat… direct obser… <NA> 
## 10 marine fish… orangespin… acti… perc… acant… naso  litura… telemetry     8    
## # ℹ 559 more rows
## # ℹ 16 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## #   alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## #   hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## #   trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## #   PPMR <dbl>, prey.size.reference <chr>, mass_category <chr>

Here we check how the newly created body mass categories compare across trophic.guild.

homerange %>% 
  mutate(mass_category=case_when(log10.mass<=1.75 ~ "small",
                                 log10.mass>1.75 & log10.mass <= 2.75 ~ "medium",
                                 log10.mass>2.75 ~ "large")) %>% 
  ggplot(aes(x=mass_category, fill=trophic.guild))+
  geom_bar(position="dodge")

Practice

  1. Use case_when() to make a new column range_category that breaks down log10.hra into very small, small, medium, and large classes based on quartile.
summary(homerange$log10.hra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.523   3.653   4.595   4.709   6.016   9.550
library(gtools)
#install.packages("gtools")
quartiles <- quantcut(homerange$log10.hra)
table(quartiles)
## quartiles
## [-1.52,3.65]  (3.65,4.59]  (4.59,6.02]  (6.02,9.55] 
##          143          142          142          142
homerange %>% 
  mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
                             log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
                             log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
                             log10.hra>=6.02 ~ "large"))
## # A tibble: 569 × 25
##    taxon        common.name class order family genus species primarymethod N    
##    <chr>        <chr>       <chr> <chr> <chr>  <chr> <chr>   <chr>         <chr>
##  1 lake fishes  american e… acti… angu… angui… angu… rostra… telemetry     16   
##  2 river fishes blacktail … acti… cypr… catos… moxo… poecil… mark-recaptu… <NA> 
##  3 river fishes central st… acti… cypr… cypri… camp… anomal… mark-recaptu… 20   
##  4 river fishes rosyside d… acti… cypr… cypri… clin… fundul… mark-recaptu… 26   
##  5 river fishes longnose d… acti… cypr… cypri… rhin… catara… mark-recaptu… 17   
##  6 river fishes muskellunge acti… esoc… esoci… esox  masqui… telemetry     5    
##  7 marine fish… pollack     acti… gadi… gadid… poll… pollac… telemetry     2    
##  8 marine fish… saithe      acti… gadi… gadid… poll… virens  telemetry     2    
##  9 marine fish… lined surg… acti… perc… acant… acan… lineat… direct obser… <NA> 
## 10 marine fish… orangespin… acti… perc… acant… naso  litura… telemetry     8    
## # ℹ 559 more rows
## # ℹ 16 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## #   alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## #   hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## #   trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## #   PPMR <dbl>, prey.size.reference <chr>, range_category <chr>
  1. Make a plot that shows how many and which taxonomic classes are represented in each range_category.
homerange %>% 
  mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
                             log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
                             log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
                             log10.hra>=6.02 ~ "large")) %>% 
  ggplot(aes(x=range_category, fill=class))+
  geom_bar(position = "dodge", alpha=0.6, color = "black")

  1. Isolate the small range_category and plot the range of log10.mass by taxonomic class.
homerange %>% 
    mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
                             log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
                             log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
                             log10.hra>=6.02 ~ "large")) %>% 
  filter(range_category=="small") %>% 
  ggplot(aes(x=class, y=log10.mass, fill=class))+
  geom_boxplot()

Lab 12.2-Data Visualization: ggplot part 3

ggplot themes

There are many options to change the theme of your plots within ggplot. Have a look [here]https://ggplot2.tidyverse.org/reference/ggtheme.html) for a list of the themes.

Let’s start by building a simple barplot.

p <- homerange %>% 
  ggplot(aes(x=taxon, fill=trophic.guild))+
  geom_bar(na.rm=T, position="dodge")

Have a look at the linedraw theme; I am adding it as another layer.

p + theme_linedraw()+
  theme(axis.text.x = element_text(angle = 60, hjust=1))+
  labs(title = "Observations by Taxon in Homerange Data",
       x = NULL,
       y= "n",
       fill= "Trophic Guild")

Legends

There are lots of options to manipulate legends. Have a look here.

p+theme_linedraw()+
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 60, hjust=1))+
  labs(title = "Observations by Taxon in Homerange Data",
       x = NULL, # removes the label from the x axis
       y= "n",
       fill= "Trophic Guild")

Not enough? Try using ggthemes

There are many packages that include additional themes, one of which is ggthemes. Some of these are nice because they are designed to mimic the look of popular publications.

#install.packages("ggthemes")
library(ggthemes)

Here is a list of the ggthemes

#ls("package:ggthemes")[grepl("theme_", ls("package:ggthemes"))]
p + 
  theme_fivethirtyeight()+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 60, hjust=1))+
  labs(title = "Observations by Taxon in Homerange Data",
       x = NULL,
       y= "n",
       fill= "Trophic Guild")

RColorBrewer

The default colors used by ggplot are often uninspiring. They don’t make plots pop out in presentations or publications, and you may want to use a customized palette to make things visually consistent.

Access the help for RColorBrewer.

?RColorBrewer

The thing to notice is that there are three different color palettes: 1) sequential, 2) diverging, and 3) qualitative. Within each of these there are several selections. You can bring up the colors by using display.brewer.pal(). Specify the number of colors that you want and the palette name.

display.brewer.pal(5,"YlGnBu") #sequential palette

The R Color Brewer website is very helpful for getting an idea of the color palettes. To make things easy, use these two guidelines:

+scale_colour_brewer() is for points
+scale_fill_brewer() is for fills

Here I chose the Paired palette. Take a moment and experiment with other options.

p+scale_fill_brewer(palette = "Paired")+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 60, hjust=1))+
  labs(title = "Observations by Taxon in Homerange Data",
       x = NULL,
       y= "n",
       fill= "Trophic Guild")

Manually Setting Colors

You can also use paleteer to build a custom palette for consistency. To access the paleteer collection, I add it to a new object.

colors <- paletteer::palettes_d_names

Now we can display the palettes. Assign the palette to my_palette and then build this base R bar plot. There are a lot of options; paleteer is a collection of popular palettes. I really like the [ggsci package] (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)

my_palette <- paletteer_d("futurevisions::atomic_orange") #Store your palette
barplot(rep(1,6), axes=FALSE, col=my_palette) # Show your palette

Now we just identify my_palette as part of scale_fill_manual()

p+scale_fill_manual(values=my_palette)+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 60, hjust=1))+
  labs(title = "Observations by Taxon in Homerange Data",
       x = NULL,
       y= "n",
       fill= "Trophic Guild")

Faceting

Faceting is one of the amazing features of ggplot. It allows us to make multi-panel plots for easy comparison. Here is a boxplot that shows the range of log10.mass by taxon.

homerange %>% 
  ggplot(aes(x=taxon, y=log10.mass))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 60, hjust=1))

There are other categorical variables that might be interesting to overlay. facet_wrap() makes a ribbon of panels by a specified categorical variable and allows you to control how you want them arranged.

homerange %>% 
  ggplot(aes(x=taxon, y=log10.mass))+
  geom_boxplot()+
  facet_wrap(~trophic.guild)+
  theme(axis.text.x = element_text(angle = 60, hjust=1))

facet_grid() allows control over the faceted variable; it can be arranged in rows or columns. rows~columns.

homerange %>% 
  ggplot(aes(x=taxon, y=log10.mass))+
  geom_boxplot()+
  facet_grid(trophic.guild~.)+
  theme(axis.text.x = element_text(angle = 60, hjust=1))

facet_grid() will also allow the comparison of two categorical variables, just remember a~b where a is rows and b is columns.

homerange %>% 
  ggplot(aes(x=taxon, y=log10.mass))+
  geom_boxplot()+
  facet_grid(trophic.guild~thermoregulation)+
  theme(axis.text.x = element_text(angle = 60, hjust=1))